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Abstract 

We present a technique for clustering categorical data by generating many dissimilarity ma¬ 
trices and averaging over them. We begin by demonstrating our technique on low dimensional 
categorical data and comparing it to several other techniques that have been proposed. Then we 
give conditions under which our method should yield good results in general. Our method ex¬ 
tends to high dimensional categorical data of equal lengths by ensembling over many choices of 
explanatory variables. In this context we compare our method with two other methods. Finally, we 
extend our method to high dimensional categorical data vectors of unequal length by using align¬ 
ment techniques to equalize the lengths. We give examples to show that our method continues 
to provide good results, in particular, better in the context of genome sequences than clusterings 
suggested by phylogenetic trees. 

Keywords: Categorical data; Ensembling methods; High dimension; Monte Carlo investigation. 

1 Introduction 

Clustering is a widely used unsupervised technique for identifying natural classes within a set of 
data. The idea is to group unlabeled data into subsets so the within-group homogeneity is relatively 
high and the between-group heterogeneity is also relatively high. The implication is that the groups 
should reflect the underlying structure of the data generator (DC). Clustering of continuous data 
has been extensively studied over many decades leading to several conceptually disjoint categories of 
techniques. However, the same cannot be said for the clustering of categorical data which, by contrast, 
has not been developed as extensively. 

We recall that categorical data are discrete valued measurements. Here, we will assume there 
are finitely many discrete values and that there is no meaningful ordering on the values or distance 
between them (i.e., nominal). For instance, if the goal is to cluster genomes, the data consist of strings 
of four nucleotides (A, T, C, G). Typically these strings are high dimensional and have variable length. 
There are many other similar examples such as clustering a population based on the presence/absence 
of biomarkers in specific locations. 

We first propose a technique for low dimensional equi-length categorical data vectors as a way to 
begin addressing the problem of discrete clustering in general. Second, we extend this basic technique 
to data consisting of high dimensional but equal length vectors. Then, we adapt our technique to 
permit unequal length vectors. Our approach is quite different from others and there are so many 
approaches that we defer discussion of a selection of them to Sec. 
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Heuristically, our basic method in low dimensions is as follows. Start with n vectors of dimension 
J, say Xi = {xii,...,Xjj) for i = where, for i = \,...,n and j = Xjj assumes values in a finite 

set Ay = fly} that can be identified with the first fly = #(Ay) natural numbers. For each j = 

we form a n x fly membership matrix Mj by treating each of the / variables separately. Then the 
combined membership matrix is M = of dimension n x Hyfly. Using Hamming distance 

on the rows of M gives a dissimilarity measure on the Xj’s, say, d. Now, we can use any hierarchical 
method to generate a clustering of the x,'s but we argue that adding a second stage by ensembling 
will improve performance. 

We add a second stage by first choosing K}, ~ DUnif[2, Vn] for b - and assume the Kj,'s are 

distinct. For each Ky, we use a hierarchical clustering technique based on d to produce a clustering 
with Ky clusters. Each of the B clusterings gives a combined membership matrix My = {Myl,...,MyJ) 
of the form M above. Again, we can use the Hamming distance on the rows as a dissimilarity, say D. 
As with d, any hierarchical clustering method can be used with the ensemble level dissimilarity D. 

In our work, we found that using d and D with the same linkage criterion typically gave better 
results than using d and D with different linkage criteria. Thus, we have six categorical clustering 
techniques: Three use the basic method based on d (i.e., first stage only) with single linkage (SL), 
complete linkage (CL), and average linkage (AL), while the other three use the ensemble method 
based on d and D (i.e.,first and second stages) with SL, CL, or AL. As a generality, in our comparisons 
we found that the ensembled clusterings under AL or CL typically gave the best results. 

We describe two paradigms in which the sort of ensembling described above is likely to be ben¬ 
eficial. One paradigm uses squared error loss and the familiar Jensen's inequality first used in bag¬ 
ging, an ensembling context for classification, see Breiman"] } 1996| |. The other paradigm uses a dis¬ 
similarity that parallels zero-one loss. Both descriptions rely on the concept of a true clustering 
Cy = of size fCf based on the population defined by P, the probability distribution 

generating the data. Lor our results, we assume Cj- is uniquely defined in a formal sense. 

We extend this basic method to high dimensional data vectors of equal length by partitioning the 
vectors into multiple subvectors of a uniform but smaller length, applying our basic method to each 
of them, and then combining the results with another layer of ensembling. 

Our second extension, to variable length categorical vectors, is more complicated but can be used 
to address the clustering of such vectors in genome sciences. It involves the concept of alignment. 
There are various forms of alignment (local, semi-global, global, etc.) but the basic idea is to force 
vectors of categorical data of differing lengths to have the same length by adding an extra symbol 
e.g., a 0, in strategic places. Then the resulting equal length vectors can be clustered as in our first 
extension. 

In the next section, we review the main techniques for clustering categorical data and indicate 
where they differ from the methods we have proposed. In Sec. [^we formally present our technique 
described above and provide a series of examples, both simulated and with real data, that verify our 
basic technique works well. We also present theoretical results that suggest our method should work 
well in some generality. In Sec. [^we present our first extension and in Sec. [^we present our second 
extension. In the final Sec. we discuss several issues related to the use of our method. 
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2 Clustering Techniques for Categorical Data 


In this section, we review six techniques for clustering categorical data, roughly in the order in which 
they were first proposed. 

2.1 K-modes 


fC-modes, see Huang (1998 ', is an extension of the familiar fC-means procedure for continuous data 


to categorical data. However, there are two essential differences. First, since the mean of a cluster 
does not make sense for categorical data, the modal value of a cluster is used instead; like the mean, 
the mode is taken componentwise. Second, in place of the Euclidean distance, fC-modes uses the 
Hamming distance, again componentwise. The initial modes are usually chosen randomly from the 
observations. As recognized in Huang ( 1998| l, this leads to instability and frequent inaccuracy in that 
fC-modes often gives locally optimal clusterings that are not globally optimal. 

There have been many efforts to overcome the instability and inaccuracy of fC-modes clustering 
with categorical data. Indeed, Huang ( 1998| suggested choosing the initial modes to be as far from 
each other as possible. Even if this were formalized, it is not clear how it would ensure the resulting 
fC-modes clustering would be accurate or stable. A different approach was taken in Wu at al. ( 2007| |. 
These authors used the density of a point x,- defined to be p(x, ) = d(x,',x,',) so that a point with 

high density should have many points relatively close to it. Cao et al. (2009| used a similar density 


at points. Eurther attempts to find good initial values are in Bai et al. (2012 ' who use the algorithm 
in Cao et al. (2009|l with a different distance function. Also, Khan and Ahmad (2013[l proposed 


a method for selecting the most relevant attributes and clustering on them individually. Then, they 
take representatives of the clusterings as the initial values for a K-modes clustering. 

Our method is based on ensembling so it is automatically stable. 


2.2 DBSCAN 

There have been several papers using the density-based algorithm, DBSCAN (Density-Based Spatial 


Clustering of Applications with Noise). Originally proposed for continuous data by Ester et al. 
( |1996| l, it extends to categorical variables because Hamming distance can be used to define a dissimi¬ 
larity matrix. DBSCAN defines a cluster to be a maximum set of density-connected points; two points 
are density connected if and only if the distance between them is less than a pre-assigned parame¬ 
ter. This means that every point in a cluster must have a minimum number of points within a given 


radius. There are other approaches that are similar to the DBSCAN such as Cactus, see Ganti et al. 
[ ( |1999 and Clicks, see |Zaki et al. (2007'. These and other methods are used in Andreopoulos and 
Wang (|2007 on ZOO and SOYBEAN data where it is seen they do not outperform K-modes. Hence, 


we do not use Cactus, Clicks or their variants here. Note that to use these methods one must choose a 
distance parameter that influences the size of the resulting clusters. 

Our method requires no auxilliary parameter and combines clusterings over randomly selected 
dimensions avoiding the question of 'density' in a discrete context. 


3 















































































2.3 ROCK 


Guha et al. (19991 presented a robust agglomerative hierarchical-clustering algorithm that can be 
applied to categorical data. It is referred to as ROCK (RObust Clustering using linKs). ROCK employs 
'links' in the sense that it counts the number of point-to-point hops one must make to form a path 
from one data point to another. Note that this relies on the number of points there are between two 
selected points, but not directly on the distance between them. In essence, ROCK iteratively merges 
clusters to achieve high within-cluster similarity. However, this de facto requires a concept of distance 
between points by way of hops - two hopes being twice as long as one hop. Our method does not rely 
on distance - it only counts the number of locations at which two strings differ. 


2.4 Hamming Distance (HD) 


Zhang et al. |2006| l use the Hamming distance (HD) from each observation to a reference position 
r. Then, they form the histogram generated by the values d{xj,r) and set up a hypothesis test. Let 
Ho be that there are no clusters in the data and take Hi to be the negation of this statement. Under 
Hq, we expect the histogram to be approximately normal, at least for well chosen r. Under the null 
hypothesis, lack of clustering, Zhang et al. ( |2006[ l estimates the frequency of the uniform 'HD'. This 
is compared to the HD vector with respect to r by way of a Chi-squared statistic. If the Chi-square 
statistic is too large, this is evidence against the null. So, the data that make the Chi-square large are 
removed and the process is repeated. This is an iterative method that relies on testing and the choice 
of r. It is therefore likely not stable unlike an ensemble method. 


2.5 Model based clustering (MBC) 

Model based clustering has attracted a lot of attention, see Fraley and Raftery ( 2002| . The main idea 
is that an overall mixture model for the observations can be identified and the subsets of data can be 
assigned to components in the mixture. So, the overall model is of the form 

K 


fix) = J^nkPk(xl6/,), 


k=l 


where 11^=1 = 1, 7T;t ^ 0 and the p]^’s are the components. The can be used as indicators of what 

fraction of data are from each component and the p^’s can be used to assign data to components. The 
likelihood of a mixture model is 


n K 

! = 1 k=l 
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where the parameters can be estimated using the Expectation-Maximization algorithm. When the 
data are categorical, the pj^’s should be categorical and Celeux and Govaert (19911 proposed 


j=l h=l 

where 6^ = {&[. ,h = l,...,mp; = Y.}iLi^k ~ ^ ^k the probability of category h 

of the variable ] in the cluster k. The proposed model is actually the product of A conditionally 
independent categorical distributions. Several versions of this method are implemented in the R 
package Rmixmod. 

Model based clustering should work well - and does work well for continuous data. For categorical 
data it is not at all clear when a mixture model holds or even is a good approximation. In fact, the 
mixture model likely does not hold very often and we would only expect good performance from 
MBC when it does. By contrast, ensembling should be perform reliably well over a larger range of 
DG's. 


2.6 Ensemble approaches 


Another approach is to regard the categorical data as the result of a clustering procedure. So, one can 
form a matrix in which the rows represent the n data points and each of the columns represent the 
values of an attribute of the data point. If the attributes are taken as cluster labels then the clusterings, 
one for each attribute, can be used to form a consensus clustering by defining a dissimilarity between 
data points using Hamming distance. This process is known as ensembling and was first used on 
categorical data by [He et al. (20051 via the techniques GSPA, HGPA and MGLA. From these. He et 


al. (20051 select the clustering with the greatest Average Normalized Mutual Information (ANMI) as 
the final result. This seeks to merge information among different clusterings. However the mutual 
information is measure of dependence and finding clusterings that are dependent is not the same as 
finding clusterings that are accurate. They apply their technique to several data sets but note that it 
does not perform well on unbalanced data like ZOO. The actual process of ensembling is reviewed in 


Strehl and Ghosh (2002 '. We find that ensembling over the dissimilarities is a better way to ensemble 


since it seems to give a more accurate assessment of the discrepancy between points. 


The idea of evidence accumulation clustering (FAG) is due to Fred and Jain (2005' and is an 
ensembling approach that initially was used for continuous data. The central idea is to create many 
clusterings of different sizes (by fC-means) that can be pooled via a 'co-association matrix' that weights 
points in each clustering according to their membership in each cluster. This matrix can then be 
easily modified to give a dissimilarity so that single-linkage clustering can be applied, yielding a final 
clustering. The first use of EAG on categorical data seems to be lam-On et al. | ( 2012| |. They used 
the fC-modes technique with random initializations for cluster centers to generate B base clusterings. 


Then they ensembled these by various methods. In Section 3.3 we use this approach for categorical 


data but using R-modes instead X-means (EN-KM). The same idea can be used to implement EAG on 
the MBG (EN-MBG). Again, our method ensembles over dissimilairties rather than clusters directly. 
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It seems that in catgorical clustering getting a good way to assess discrepancy between data points 
provides better solutions than trying to merge different clusterings directly. 


3 Basic Technique in Low Dimensions 

Here we present our ensemble technique for clustering categorical data, provide some justification 
for why it should perform well, and see how this is borne out in a few examples. 


3.1 Formal presentation 

To fix notation, we assume n independent and identical (IID) outcomes x„ z = !,...,« of a random 
variable X. The x,'s are assumed /-dimensional and written as (x,i,.. .,x,|) where each x,y is an element 
of a finite set Ay; the Ay's do not have to be the same but supy #(Ay) < co is required for our method. 

We denote a clustering of size K hj = (Cxi,...,C]fx); often we assume that for each K only 
one clustering will be produced. Without loss of generality, we assume each Ay is identified with fly 
natural numbers. 

We write the x,'s as the rows in an observation matrix and look at its columns: 

= { Cl ... Cj ). (1) 

That is, Cj — (Xiy,...,X,iy) € 

Now, for each cy form the fly x n membership matrix Mj with elements 


O 


^1 


1 


= 


0 


Xij=£ 

else 


where / = 1,..., fly and i = The combined membership matrix is 

M = ( Mf ... mJ ), 


( 2 ) 


the result of concatenating / matrices with dimensions n x Ui,...,n x Uj. Thus, M is « x A where 
A = L,y=i Wj\- ^0 use the transpose of the Mj’s so that each row will correspond to a subject. Rewriting 
M in terms of its rows i.e., in terms of subjects 1 through n, gives 




M = 


V 




Vn ) 


(3) 
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Now, consider the dissimilarity D = (d(r'v'^^))v,;<=i,...,M where 


/ 

d(Vy,v^) — ^ d(r’yir, 
T=1 


Here, b{a, b) = 0 ii a = b and one otherwise. Under this dissimilarity one can do hierarchical clustering 
under single linkage (HCSL), average linkage (HCAL), and complete linkage (HCCL). 

This gives three clustering techniques for categorical data. However, we do not advocate them 
as we have described because, as will be seen below, we can get better performance by ensembling 
and treating sets of outlying points representing less than a 100% of the data separately (see [Amiri et 
jal. I ( |2015| | for the basic technique to handle potential chaining problems and outliers under single 
linkage context). 

Our ensembling procedure is as follows. Choose one of HCSL, HCAL, or HCCL and draw ~ 
DUnif[2, ^Jn\ for b = For each value of b use the chosen method to find a clustering of size Ky. 

Then form the incidence matrix 


Wi2 

Wi2 


WlB 

Wnl 

^n2 


WfiB 


(4) 


Note that each column corresponds to a clustering and Wjy is the index of the cluster in the b-th 
clustering to which x; belongs. For any of the three linkages, we can find the dissimilarity matrix 


T = {dB{Xi,Xj))i^j=i ., 


(5) 


in which 


1 

dsi^i’^j) ~ g ^ 6(Wiy,Wjy). 
° h=l 


( 6 ) 


Note that we only compare the entries in a row within their respective columns. Thus, we never 
compare, say Wn with Wij but only compare Wn to entries of the form Wn for Now, we can use 


the same linkages on D as before (SL, AL, CL) with the treatment of small sets of outliers as in [Amiri 
et al. |2015|| to get a final clustering for any given K. (The choice of K can also be estimated by any of 


a number of techniques; see Amiri et al. (2015 ' for a discussion and comparison of such techniques.) 


3.2 Why and when ensembling works 

Consider two IID observations X and Y from the population defined by P assumed to have a well 
defined true clustering C-f. Write the cluster membership function as 

C'ir(x) = k X € Cxk- 


1 
























Now define the similarity of Cf (X) and C 7 -(Y) to be the indicator function 


4c.,x,,c.,n) = l‘ = 

|0 else 


This gives the (random) dissimilarity d = 1 - ;f(C'jr(X), C 7 ’(¥)). Clearly ^ (0,1) so that 

E{6) = p(Ct{X)^Ct(V)) = Ps. 

Empirically, we use data V to form a clustering denoted (p = (p(- j V) where (p{x \ V) is the index of the 
cluster in C{V) containing x. So, 

6 = l-x((p(^,V\(P(Y,V)\ 

in which X and Y are independent of the IID data V. Now, given B independent replications of the 
clustering technique we obtain 


So, we can form the dissimilarity 


JEN 


1 ® 
^ b=l 


(7) 


iid 


where k}, ~ DUnif(K£,Ki^), K£ and K^^ are the minimum and maximum acceptable sizes for a candi¬ 
date clustering. Note that the expression in ([^ reduces to ([^ when X = x, and Y = Xj. Thus, ([^ is a 
specific instance of ^ so when we do the hierarchical clustering in Sec. 3.1 we are using a dissimilar¬ 
ity of the same form as 0. Consequently, any statement involving Sen will lead to a corresponding 
statement for the ensembling in our clustering technique. 

The following proposition shows that the ensemble dissimilarity is closer to the true value of the 
dissimilarity on average than any individual member of the ensemble when the clusterings in the 
ensemble are 'mean unbiased'. Formally, a clustering method is mean unbiased if and only if the 
expectation of the empirical dissimilarity, conditional on the data and number of clusters, is and 
so independent of V and k},. That is, a clustering method is mean unbiased if and only if 


£(dfcJP,4) = p{Cr(X)^CT(Y)). 


( 8 ) 


The intuition behind this definition is that the probability that two data points are not in the same 
cluster is independent of /cj, for any chosen between k^ and /c„ for fixed data V. Note that, in 
this expression, the expectation is taken over the collection of possible clusterings holding V and ky 
fixed. This requires that some auxiliary random selections, such as the random selection of initial 
cluster centers in X-means, is built into the clustering technique. When a clustering technique is 
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mean unbiased we get that 


Pd = ^{^b\'^>h) = E{^EN\'^>h)> (9) 

and hence the ensembling preserves the mean unbiasedness. In this case, we can show that the en- 
sembling behaves like taking a mean of IID variables in the sense that the variance of the average 
of the 61 ,'s decreases as l/B. The interpretation of this is that the ensemble dissimilarity is a more 
accurate representation of the dissimilarity between any two data points than the dissimilarity from 
any individual clustering in the ensemble. 

Proposition 1. Assume all the clusterings are mean unbiased. Then, for each D and set /cj,...,/cg there 
is a B* so that B> B* implies 

P{ I >e I P,/ci,...,/cb) < mmP( \ 6 i^^-E 6 | >e | V,k},). (10) 

Proof. The expected dissimilarity for any clustering function (p is 

£(^| V,k) = P((P(X,V) = (P(Y,V)\V,k) = Pd(V,k), (11) 

i.e., possibly dependent on V and the number k of clusters in (p. Consequently, the conditional prob¬ 
ability 

P(\6-E(3\V)\>e\V,k)>0, 

is well defined for each V and k. Since ki, is chosen from a finite set of integers, 

mmP(|4^ -£(dj.JT>)| > e\V,kh) > 0, 

i.e., the right hand side of is strictly positive. 

Next observe that in general, for each k}, there is a P^{V,ki,) as in (HD- Since the clusterings in 
the ensemble are independent over reselected kfs, and the variance of a Bernoulli(p) is uniformly 
bounded by 1/4 for p € [0,1], we have that 

P{\^EN-P^En\> £\'P'>ki,...,kB) < 


—2p(\ ^EN-P^EN 1 ^ \T>,ki,...,kB) 

/ B B 2 

. k, 


b=l b=l 




b=l 

B 


TP,kh 


h=\ 


( 12 ) 
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Hence, for B sufficiently large, \\2) implies 


P(\^EN -P^en\ > e|P,/ci,...,/cB) < P(|6{, -£6{,| > e\V,k},). 


The difference between this bound and the statement of the result is that this bound uses Pdfjv and 
ESh in place of E6. Since S}, is mean unbiased, ([^ implies that for B large enough gives 

^’(|5£N--E5|>e|P,/ci,...,/cB)< <mmP(|d;ti--E5|>eP.4)- (13) 

□ 

Note that even though mean unbiasedness implies that Ed is independent of D and k},, and 6 ]^^ 
need not be. So, the conditioning in pH] ) cannot be dropped. If we take expectations over D for finite 
n, the proof of the proposition can be modified to give 


EvP{\^en-E^\ > e\V,ki,...,kB) < minEj,P^\6k^-Ed \> ep,/:;,). 

It can also be seen that if n increases and the clustering (p{- \ V) is consistent in the sense that (p(x \ 
V) Cx{x) pointwise in x, then for an nondecreasing sequence of B* = B*{n), the inequality continues 
to hold and the conditioning on V drops out. 

Instead of arguing that the mean dissimilarity from ensembling behaves well, one can argue that 
the actual clustering from an ensemble method such as we have presented is close to the population 
clustering. That is, it is possible to identify a condition that ensures ensemble clusterings are more ac¬ 
curate on average than the individual clusterings they combine. To see this, fix any sequence /c^,..., fcg 
and let Kf <k <Ku and write 


A 

EA{V,k) 

^EN 

Pi^EN I E>,/Ci,...,/Cb) 


P{Cr(X) = (/)(X,P)|P,fc), 

1 ® 

-Y^X(CT(X),cPk,(X,V)), 

° h=l 

1 ® 

-^P(Cr(X) = (/)fc^(X,I?)P,4). 


Since the /c;,'s are drawn IID from DUnif[X£,X,J, the condition 


1 


-‘^ii 

J^EA(V,i)> EA{V,k) 

i=K{ 


is equivalent to 


Pk^PiXEN I P^,ki,...,kB) > £(A I V,k). 


(14) 
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Moreover, 


v(E{AEMku-,kB)) < v(E{A\V,K)). 

That is, ( |14[ ) is equivalent to saying the ensemble clustering is more accurate on average and has 
a smaller variance than any of its constituents. Condition ( |14[ | defines a subset of P x on 

which ensembling good clusterings will improve the overall clustering. This is a parallel to the set 
defined in Breiman ( 1996| | and quantifies the fact that while ensembling generally gives better results 
than not ensembling, it is possible that ensembling in some cases can give worse results. That is, 
the best clustering amongst the B clusterings may be better than the ensemble clustering, but only 
infrequently. This is borne out in our numerical analyses below. 

Although Hamming distance is the natural metric to use with discrete, nominal data, it is easier to 
see that ensembling clusterings gives improved performance using squared error loss. Our result for 
clustering is modeled on Sec 4.1 Breiman'] ( 1996[ l for bagging classifiers. 

Let E-d be the expectation operator for the data generator that produced D. Then, the population- 
averaged clustering is 




(15) 


Implicitly, this assumes k is fixed and not random. This is an analog to the ensemble based cluster¬ 


ing presented in Sec. 3.1 Two important squared error 'distances' between the true clustering and 
estimates of it are 


APE{(P) = E{Cr-(p{JL,V)f, 

APE{<Pa) = E{Cr-4>A{^)?- 

A Jensen's inequality gives that the squared error loss from using is smaller than from using (p. 
Proposition 2. Eet X he an independent outcome from the same distribution as generated V. Then, 

Ee,{ape(4^))>ape(4^a). 


Proof. 


APE{(Pa) = £x(Ct(X)-<J)a(X))2 

= £x[£i)(Cr(X)-<J)(X,D))2] 

< £x£i)[(Cr(X)-<J)(X,P))2] 

= EvAPE{(P). 


□ 

Obviously, the inequality might be equality, or nearly so, in which case the averaging provides little 
to no gain. On the other hand, the averaging may make the distribution of (pA concentrate around 
Cx more tightly than the distribution of (p does, in which case the inequality would be strict and the 
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difference between the two sides could be large representing a substantial gain from the ensembling. 
In this sense, the ensembling may stabilize </) as a way to reduce its variance and hence improve its 
performance. Note that if the expectation in ( |15[ l were taken over K = k, (p, or both K and (p as well, the 
Jensen's inequality argument would continue to hold with replaced by E^E-d, or E^E-^E^. 

That is, more averaging can only improve the clustering. 


3.3 Numerical analysis 

To evaluate our proposed methodologies, we did two numerical analyses, one with simulated data 
and the other with real data. Technically, the real and simulated datasets are classification data but 
we applied our clustering techniques ignoring the class labels. Thus, the question is how well the 
clustering techniques replicated the known true classes. To assess this we defined the classification 
rate (CR) to be the proportion of observations from a data set that were correctly assigned to their 
cluster or class. To calculate the CR, we start by generating the clustering from the data. Then, we 
order the clusters according to how much they overlap with the correct clusters. For instance, if the 
estimated clusters are Ci,C 2 ;andC 3 and the correct clusters are Cj,C 2 ,and, C 3 then, if necessary, we 
relabel the estimated clusters so that Ci = argmaxjCjnCj | ; = 1 , 2 ,3}, C 2 = argmaxjCynC^ | ; = 1,2,3}, 
and C 3 = arg maxjCy n C 3 | j = 1,2,3|. In this way, we ensure that each of the estimated clusters overlap 
maximally, in a sequential sence, with exactly one of the true clusters. This can be done using the 
Hungarian algorithm, see Kuhn ( 2005| for details. 

The simulated data were generated as follows. First, we fixed the dimension / = 20. Then we 
generated sets of 20-dimensional data points of sizes 50 or 125 assuming different cluster structures. 
For instance, in the equi-sized cluster case with five clusters with n = 125 we generated five clusters 
of size 25. For each such cluster, we proceeded as follows. Choose ~ Unif(0.2,0.8) IID and 

choose ~ DUnif {3,20). Then, draw 25 values from B\n{ai,pi) where Ui is the number of values 
the variable giving the first dimension can assume. Do this again for each of B\n{ai,pj) for j = 2,...,5. 
Taken together these 125 values give the first entries for the 125 vectors of length 20 to be formed. We 
proceed similarly to obtain the second entries for the 125 vectors of dimension 20, but draw a new 
fl 2 ~ DUnif {3,20) independent of Ui. That is, we choose values pj ~ Unif(0.2,0.8) IID for j = 1,...,5 
and again generate 25 values from each Bin(fl 2 ,P;) for each j. Doing this 18 more times for IID ranges 
of length fl 3 = #{A^),...,a 2 o = #(^ 20 ) gives 125 vectors of length 20 that represent five clusters each 
of size 25. Doing this entire procedure 3000 times gives 3000 such data sets. We did this sort of 
procedure for various clustering patterns as indicated in Table e.g., taking nine values in place of 
25 values in D 2 for the first cluster. The patterns were chosen to test the methods systematically over 
a reasonable range of true clusterings based on the size of the clusters. 

For each simulation scenario, we applied 12 methods. The first four (K-modes, DBSCAN, ROCK, 
MBC) are as described in Subsecs. 2.1) 2.2[ 2.3 and 2.5 The next two are ensembled versions of 
K-modes and MBC, as described in Subsec. 2.6 The last six methods are agglomerative. The first 
three are linkage based using Hamming distance but not ensembled. The last three are the same 
except they have been ensembled. These six methods are also described in 3.1 We did not include 
HD from Subsec. 2.4 in this comparison because it required the estimate of a reference position r and 
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Table 1: Design of simulated data. 


Name 

K 

{ni,n2,...) 

n 

Di 

5 

(25,25,25,25, 25) 

125 

D 2 

5 

(9,29,29,29,29) 

125 

D 3 

5 

(10,10,35,35,35) 

125 

D 4 

5 

(10,10,10,47,48) 

125 

D 5 

5 

(10,10,10,10,85) 

125 

De 

5 

(10,25,25,25,40) 

125 

D 7 

5 

(10,10,30,30,45) 

125 

Ds 

5 

(10,10,10,35,60) 

125 

Dg 

5 

(10,10,25,40,40) 

125 

Dio 

2 

(25,25) 

50 

Dll 

2 

(15,35) 

50 


The top row labeled Dj shows the equi-sized clusters for n = 125. Data sets indicated by D2 through D5 are to be 
understood as contexts for testing how the methods perform when there are five clusters with two sizes^ one small and one 
large. Data sets indicated by Dg through Dg are to be understood as contexts for testing how the methods perform when 
there are five clusters of three sizes, small, medium, and large. Data sets indicated by Djq and Du are to be understood as 
contexts for testing how the methods perform when there are two clusters, one equisized and the other consisting of a 

small and a large cluster. 


the implementation we had did not provide one. 

The results are given in the Table The entries in bold are the maxima in their column. In some 
cases, we bolded two entries because they were so close as to be indistinguishable statistically. The 
entries with asterisks in each column are the next best methods, again we asterisked entries that 
seemed very close statistically. The pattern that emerges with great clarity is that HCAL and ENAL 
perform best. This is even borne out in the mean column - although we note that the mean is merely 
a summary of the CRs. It does not have any particular meaning because the cluster structures were 
chosen deterministically. The next best method after HCAL and ENAL is MBC. Note that Dg provided 
minimal discrimination over the techniques that were not bolded, i.e., not the best, and the mean 
column has the same property. HCCL and ENCL also performed better than the worst techniques for 
this collection of examples, but performed noticeably worse than the best. 


Table 2: Classification rates of twelve methods using simulated data. 

Data 


method 

Di 

D2 

D 3 

D4 

D 5 

De 

D? 

Ds 

Dg 

Dio 

Dll 

mean 

X-modes 

0.56 

0.45 

0.46 

0.46 

0.42 

0.46 

0.47 

0.45 

0.47 

0.81 

0.78 

0.53 

DBSCAN 

0.44 

0.38 

0.43 

0.51 

0.68 

0.41 

0.46 

0.54 

0.44 

0.65 

0.77 

0.52 

Rock 

0.71‘ 

0.471 

0.45 

0.48 

0.51 

0.45 

0.46 

0.51 

0.46 

0.95 

0.76 

0.56 

MBC 

0.71* 

0.55* 

0.51* 

0.45 

0.37 

0.52* 

0.51* 

0.44 

0.50 

0.91 

0.86 

0.57 

EN-KM 

0.45 

0.28 

0.33 

0.44 

0.67 

0.35 

0.39 

0.51 

0.37 

0.74 

0.82 

0.49 

EN-MBC 

0.32 

0.28 

0.35 

0.49 

0.67 

0.35 

0.40 

0.53 

0.39 

0.87 

0.87 

0.50 

HCSL 

0.25 

0.26 

0.31 

0.41 

0.71* 

0.34 

0.38 

0.50 

0.35 

0.57 

0.73 

0.43 

HCAL 

0.85 

0.67 

0.71 

0.75 

0.81 

0.67 

0.72 

0.76 

0.71 

0.96 

0.96 

0.78 

HCCL 

0.51 

0.45 

0.49 

0.55 

0.59 

0.45 

0.50 

0.56* 

0.50* 

0.72 

0.71 

0.55 

ENSL 

0.24 

0.27 

0.31 

0.41 

0.71* 

0.34 

0.38 

0.50 

0.35 

0.57 

0.74 

0.44 

ENAL 

0.88 

0.68 

0.70 

0.69 

0.79 

0.68 

0.71 

0.75 

0.72 

0.96 

0.96 

0.77 

ENCL 

0.55 

0.46 

0.50* 

0.54* 

0.54 

0.46 

0.50* 

0.55* 

0.50* 

0.72 

0.71 

0.55 
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Next, we turn to the comparisons of the methods on real data. We use five different data sets as 
summarized in Table All can be downloaded from the UC Irvine Machine Learning Repository 
( |Lichman~ (2013 ). For these data sets we were able to implement HD, as it gave a reasonable r for 
three of the data sets (ZOO, SOYBEAN - small, and CANCER), so we compared 14 methods rather than 
13. For the remaining two data sets (Mushroom and EYMPHO) we manually found an r that gave cred¬ 
ible results. 


Table 3: Key features of five real data sets. 


Name 

K 

/ 

n 

max^-^ fly 

ZOO 

7 

16 

101 

7 

SOYBEAN-small 

4 

35 

47 

7 

Mushroom* 

2 

21 

400 

7 

Lymphography domain (EYMPHO) 

4 

18 

148 

8 

Primary tumor domain (CANCER) 

21 

17 

339 

21 


This table lists the five data sets used in our analysis giving the true number of clusters in each, the dimension of the data 
sets, the sample sizes, and the maximum number of possible distinct discrete values for each, i.e., the maxj_j aj's for each. 

* Since the dataset is large, we only used the last 400 observations in our analysis. 

The results of our analyses are presented in Table As before we bolded the best methods with 
respect to CR and asterisked the second best methods. Clearly, the best methods were ENCL and 
ENAL and their performances were similar. The next best methods were MBC and, possibly, HD. 

The results in Table also show differences from those in Table Eirst, the ensembling of the 
SL, AL, and CL methods tends to improve them substantially for the real data whereas it has only 
a slight effect on the simulated data. HCAL does well on simulated data, but poorly on the real 
data. ENCL does well on the real data but poorly on the simulated data. Ensembing of K-modes 
and MBC actually makes them worse for the simulated data and only slightly better for the real 
data. We note that ENSL, ENAL and ENCL are in terms of Hamming distance and according to 
Proposition 1, ensembling should routinely give more accurate clusterings so it is no surprise that the 
ensembled versions perform better. Moreover, according to Proposition 2 and the discussion in Sec. 
|3.2[ regardless of accuracy, the ensembled version of a clustering method has less variability, as can 
be seen for K-modes and MBC when compared with their ensembled versions. This is seen in Table 

numbers in parentheses are standard deviations (SD) and the absence of a number in parentheses 
next to an entry indicates an SD of zero. 

The inference from our numerical work is that, in some general sense, the best option is to choose 
ENAL. Recall, ENAL gave the best performance on the simulated data and nearly the best perfor¬ 
mance on the real data - only ENCL is better and only by 0.01 in terms of the means. ENAL performs 
noticeably better on the simulated data than on the real data, possibly because the real data is much 
more complicated. We explain the better performance of ENAL by recalling ^ and observing that 
it is likely to be small for AL for many data sets because AL tends to give compact clusters that are 
stable under ensembling. This follows because AL is relatively insensitive to outliers and an average 
is more stable than an individual outcome. Thus, if two points Xi and X 2 are in the same cluster in 
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Table 4: Classification rates of the proposed and comparison methods using real data. 


method 



Data 



mean 

ZOO 

SOYBEAN 

MUSHROOM 

LYMPHO 

CANCER 

X-modes 

0.72(0.08) 

0.80(0.12) 

0 . 68 ( 0 . 12 ) 

0.45(0.05) 

0.37(0.02) 

0.60 

DBSCAN 

0.88 

1 

0.50 

0.54 

0.32 

0.65 

Rock 

0.87 

1 

0.51 

0.40 

0.15 

0.59 

HD 

0.95 

1 

0.66 

0.59* 

0.28 

0.70 

MBC 

0.88(0.06) 

1 

0.97 

0.47(0.03) 

0.32(0.03) 

0.72 

EN-KM 

0.76 

0.78’^ 

0.98 

0.55 

0.28 

0.67 

EN-MBC 

0.74 

0.79* 

0.97 

0.63 

0.33 

0.69 

HCSL 

0.88 

1 

0.50 

0.57 

0.38 

0.66 

HCAL 

0.89 

1 

0.51 

0.58* 

0.35* 

0.66 

HCCL 

0.90 

1 

0.51 

0.58* 

0.29 

0.66 

ENSL 

0.88 

1 

0.73 

0.57 

0.28 

0.69 

ENAL 

0.89 

1 

0.97 

0.58* 

0.38 

0.76 

ENCL 

0.90 

1 

0.97 

0.64 

0.35* 

0.77 


a clustering of size Ki, it is unlikely that reasonable AL reclusterings of size K 2 will put Xi and X 2 in 
different clusters even when K 2 is far from Ki. Hence, AL is likely to put Xi and X 2 in the same cluster 
even for clusterings of different sizes and this will be preserved under ensembling, cf. the remarks 
after Q. 


4 Extension to High Dimensional Vectors 


Extending any clustering technique to high dimensions must evade the Curse of Dimensionality to 
be effective. The way the Curse affects clustering, loosely speaking, is to make the distance between 
any two vectors nearly the same. This has been observed in work due to Hall et al. ( 2005| but the 
basic idea can be found in Murtagh ( 2004| . The earliest statement of the Curse of Dimensionality 
in a clustering context seems to be Beyer et al. ( 1999| |. However, all of these are in the context of 
clustering continuous, not categorical, variables. Clustering categorical variables is fundamentally 
different because in categorical clustering there is usually no meaningful way to choose a distance 
whose numerical values correspond to physical distances. For instance, in DNA one could assign 
values 1, 2, 3, and 4 to A, T, C, and G. However, it is not reasonable to say A and C are twice as far 
apart as A and B. For this reason, Hamming distance, which merely indicates whether two outcomes 
are the same, is preferred. 


4.1 Ensembling over subspace clusterings 

Our first result states a version of the Curse of Dimensionality for the clustering of high dimensional 
categorical data. 

Proposition 3. Let X = (Xi,...,Xj)^ and X' = (X^,...,Xp^ be independent J dimensional categorical 
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random variables. Let 


1 ^ 

d{X,X') = -J^6{Xj,X'j) 


Then, as J ^ oo, 


;=i 


E{d{X,X'))^C and Var{d{X,X')) ^ 0, 


for some C > 0. In particular, if Xj and X'- have the same distribution for all j, there is a p > 0 so that 
E{ 8 {Xj,X'.)) = p and 


E(d(X,X')) = p and Var(d(X,X')) 


P(1 -P) 

J 


The proof is little more than the law of large numbers and heuristically means that in the limit of 
high dimensions any two independent vectors of categorical variables are equidistant. The rate of de¬ 
crease in variance controls how quickly this occurs. Hence, clustering based on distances necessarily 
degenerates or, more precisely, the output of a clustering procedure will be random in the sense that 
no one clustering can reasonably be favored over another. Prop, [^is a discrete analog of Kriegel et al. 
[ (20091, Steinbach et al. | ( '2004| , and Parsons et al. ( 2004| |. 

The two main ways analysts evade the Curse in high dimensional clustering with continuous data 
are subspace clustering, i.e., find a clustering on a subset of the variables and Tift' it to a clustering of 
the entire vector, and feature selection, in which clustering is done on relatively few functions of the 
variables thought to characterize the clusters in the data. 

Evidence has accumulated that feature selection does not work very well on continuous data - 
absent extensive knowledge about which features are relevant - see Yeung and Ruzzo ( 2001| |, Chang 
(1983 ', Witten and Tibshirani || (2010[|. Indeed, it can be verified that if generic methods for obtaining 
features, e.g., PCA, are used with categorical data, the computing demands become infeasible. Since 
techniques based on feature selection are even harder to devise and compute for discrete data, feature 
selection does not seem a promising approach to high dimensional clustering of categorical data. 


However, various forms of subspace clustering have shown some promise (Friedman and Meulman 
( 2004| |, Jing et al. ( 2007| , Witten and Tibshirani ( 2010| |, and Bai et al. ( |2011 ). The common feature 
of these subspace methods is that they seek to identify a single subspace from which to generate a 
good clustering. 

We refine the idea of subspace clustering by adding a layer of ensembling over randomly chosen 
subspaces. That is, we randomly generate subspaces, derive clusterings for each of them, and then 
combine them analogous to the procedure that generated ([^. Our procedure is as follows. 

Recall that the vector of categorical variables is x e Ai x ... x Aj and write J = hR for some integers 
h and R. Then, each x can be partitioned into R subvectors of length h. So, each data point of the 
form X gives R submatrices of dimension nxh and the whole data set xi,..., x„ can be represented by 
matrices of the form 

^l,{r-l)h+l ■■■ ^l,rh 


Lr{n) 


An,{r-l)h+l 


^n,rh, 
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Now, each can be clustered by any of the techniques in Sec. 3.1. 

Recall, the conclusion of Sec. 3.1 was that ENAL gave the best results overall. Let the result of, say, 
ENAL on !,.(«) be denoted C{Kj.) = {Cri,...,CrK^}, for r = 1,...,R. This gives R clusterings of the full 
data set denoted by 

{C(Ri),...,C(Rr)}. 

These clusterings can be combined by using the dissimilarity in Q and then any hierarchical method 
can be used to obtain final clusters. In our computing below, we chose AL because it did slightly 


better than CL in Sec. 3.3 In our work below, we refer to this method as subspace ensembling. 

This basic template can obviously be extended to the case where the equi-sized sets of variables 
are chosen randomly. This is an improvement because it means that variables do not have to be 
adjacent to contribute to the same clustering. One obvious way to proceed is to choose random subsets 
of size h without replacement until R subsets are obtained. In addition, h itself can vary across subsets 
in which case the requirement that J = hR is replaced by the requirement that J = + ■■■ + hy for 

some randomly chosen v. We denote this form of subspace ensembling by WOR. Note that even if 
two variables are useful in the same clustering they will rarely be chosen to contribute to the same 
clustering. A second way to proceed is where h varies across subsets but the subsets of variables are 
chosen with replacement and then duplicate variables are removed from each Lj.(n); we denote this 
method of subspace ensembling by WR. This permits variables that contribute usefully to the same 
clustering to have a fair chance of being used in the same clustering. On the other hand, WR does 
not require each variable to contribute to the overall clustering. Eor computational efficiency, the 
WR method we used had an extra layer of subsampling and discarding of repeated variables. So, if 
we denote by Ly{n)' the result of removing duplicates from L^{n) as described, we again chose with 
replacement and removed duplicates from each Ly{n)'. This ensured that the clusters in the ensemble 
were based on few enough variables that results could be found with reasonable running times. One 
way to see that this double resampling approach is likely to reduce the dimension of each subspace is 
demonstrated by the following. 

Proposition 4. Suppose a bootstrap sample of size J is taken from a set of] distinct objects denoted {1,...,/}. 

1) Let N be the number of distinct elements from {1,...,/} in such a sample. Then for I <k <J, 


P(N = k) 


({)(!:. . 




( ^ )) 

Vai ... ai,>l 


V 


2) Non; suppose a second level of bootstrap sampling is done on the N = k distinct elements. Let N* be 
the number of distinct elements in the second level bootstrap sample. Then, for I <k* <k, 


P{N* = k*\N = k) 




kk 
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Proof. For 1) the cardinality of the event [N = k} is 


#( ways to get k distinct elements from / draws from / distinct objects) 


= #( subsets of size k from a set of / distinct objects) 
X #( ways to allocate the / draws to the k values ) 



Now, the expression for P(N = fc) in 1) follows by dividing ( |16[ l by , the number of points in the 
sample space. For 2), merely note that, conditional on N = k, the probability that N* = k* is the same 
as ( [T^ but with / replaced by k and k replaced by fc*. □ 

Clearly, N* < N and so E{N*) < E{N). Also, expressions for E{N) and E{N* \ N = k) follow from 
Prop. 4 but are not edifying. Flowever, given these. 


E(N*) = '^'^k*P(N* = k*\N = k)P{N = k) 


k=i t=i 


is relatively easy to find. The value of E(N)/J for / > 1000 is about 0.63 in agreement with the fact 
that the probability of a given x, not being chosen is (1 - l/n)" 1/e as n ^ oo and .63 « 1 - (1/e). For 

the double bootstrapping, E{N*)/J « 0.47 can be verified computationally. 


4.2 Numerical evaluation of subspace ensembling 


We compare four methods using simulated data and two real data sets. The four methods are the pro¬ 
posed techniques WR and WOR with M = 200 ensembles, the mixed weighted X-modes {MWKM) 
method due to Bai et al. |2011[|, and the sparse hierarchical method {SEIM) due to Witten and Tib- 


shirani (2010|. Bai et al. (2011|| only tested their method on dimensions up to 68; we use their 


method as implemented in code supplied by the authors for 4027, 44K, and 50k dimensions. SHM 
was intended for continuous data; we implemented a version of code supplied by the authors that we 
modified for categorical data to use Flamming distance. Unfortunately, in our computations SHM 
performed poorly so the results are not presented here. 


4.2.1 Simulations 

Our simulation study involves data that mimics genetic sequence data. Consider a sample of size 
n = fij where K = 5 and is the number of observations from the i-th cluster, regarded as if 
it were a population in its own right as well as a component in the larger population from which 
the sample of size n was drawn. We assume there are / = 50,000 categorical variables each drawn 
from the set S = {A, T, C, G}. Then, to simulate data, it remains to assign distributions to the 50,000 
variables. 
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First, we consider data sets denoted Vi and formed as follows. Suppose the first qi dimensions 
for Hi samples are drawn IID from the set E = {A,T,C,G] where Pq = Pq - ^ the 

first qi dimensions for the other subsamples rij, i = 2,...,5 are drawn IID from E uniformly. Similarly, 
for the next q 2 dimensions, for the n 2 samples draw values using the same distribution as for the 
first qi variables for tii and use the uniform on E for nj, ^ 3 ,^ 4 ,^ 5 . Repeat this procedure to fill in 
the q^, and q^ dimensions for the n data points so the data have K = 5. To add noise to data, 
we take the last q(, variables to be IID uniform from E as well. For the purposes of evaluating the 
various methods, we choose the number of variables randomly by setting qi,...,q^~ Mnom(50000,p = 
(0.15,0.15,0.15,0.15,0.15,0.25)). To generate V 2 we increased the number of noise variables by taking 
qi,...,q(^ ~ Mnom(50000,p = (0.1,0.1,0.1,0.1,0.1,0.5)). 

Tablej^shows the mean CRs for WOR, WR, and MWKM for 500 data sets of the form Dj and T >2 
for each setting of the numbers of samples in the five clusters. Obviously, WR substantially outper¬ 
forms the other methods. We attribute this to the fact that WR permits variables that are related to 
each other to recur in the ensembling so that the basic method from Sec. [^can find the underlying 
structure in the subspaces. Note also that the performance of WR is insensitive to the cluster size. 


Table 5: Classification rates of three methods using simulated data. 



n = ( 10 , 10 , 10 , 10 , 10 ) 

n = (5,10 

,10,10,15) 

method 

TPi 

V 2 


V 2 

WOR 

0.501 

0.473 

0.564 

0.488 

WR 

0.998 

0.989 

0.997 

0.987 

MWKM 

0.615 

0.583 

0.615 

0.587 


n = {5,l 

i, 13,13,14) 

n = (5,5, 

10,15,15) 

WOR 

0.504 

0.589 

0.531 

0.589 

WR 

0.974 

0.998 

0.978 

0.996 

MWKM 

0.606 

0.574 

0.6128 

0.578 


n = (5, 

5,5,17,18) 

n = (5,5 

,5,10,25) 

WOR 

0.566 

0.678 

0.622 

0.692 

WR 

0.977 

0.996 

0.968 

0.995 

MWKM 

0.597 

0.577 

0.555 

0.541 


m = (5,^ 

i,l 0 , 10 , 20 ) 

n = (5,5,5,5,30) 

WOR 

0.552 

0.631 

0.688 

0.752 

WR 

0.976 

0.998 

0.962 

0.995 

MWKM 

0.617 

0.570 

0.514 

0.487 


4.2.2 Asian rice (Oryza saliva) 

We consider a population consisting of five varieties of rice (Oryza sativa) and use clustering on single 
nucleotide polymorphism (SNP) data to assess the plausibility of the division of the species into five 
varieties. The data, RICE, were originally presented and analyzed in Zhao et al. (20111 and consist of 
391 samples from the five varieties indica (87), aus (57), temperate japonica (96), aromatic (14) and 
tropical japonica (97) where the numbers in parentheses indicate the number of samples from each 
variety. 
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Figure 1: Circular dendrogram generated by the same technique as in Zhao et al. | ( |201 l) (left) and WR 
(right), for the RICE data. Green, blue, red, yellow, and black indicate indica, aus, temperate japonica, 
aromatic and tropical japonica, respectively. 


The analysis done in Zhao et al. (20111 was to measure the genetic similarity between individuals. 
Essentially, Zhao et al. |2011| | calculate the proportion of times a pair of nucleotides at the same po¬ 
sition differ. Mathematically, this is equivalent to using a version of the average Hamming distance. 
Note that in their analysis they ignored missing values as is permitted in PLINK, [Purcell et al. ( 2007| |. 
Setting K = 5 we regenerated their analysis and dendrogram. The result is shown on the left hand 
panel of Fig. Around the outer ring of the circle the correct memberships of the data points are 
indicated. The CR for this clustering is one. 


For comparison, the right hand panel in Fig. [^shows the dendrogram for clustering the RICE data 
using WR, again setting K = 5. The CR was found to be one. No dendrogram for MWKM can be 
shown because it is not a hierarchical method. However, the CR for MWKM was 0.87, making it 
second best in performance. Even though PFINK and WR have the same CR, visually it is obvious 
that WR gives the better dendrogram because the clusters are more clearly separated. That is, WR 
does not perform better in terms of correctness but does provide a better visualization of the data. 
This is the effect of ensembling over dissimilarity matrices. 


4.2.3 Gene expression data 


In this example we demonstrate that the performance of WR can be regarded as robust. Consider the 
gene expression data presented and analyzed in Alizadeh et al. ||2000||. It is actually classification 


data and analyzed as such in Dudoit et al. (2002|. Here, for demonstration purposes we compare 
some of the clusterings we can generate to the known classes so as to find CRs. The sample size is 
62 and there are three classes: diffuse large B-cell lymphomas (D) with 42 samples, follicular lym¬ 
phoma (F) with 9 samples, and chronic lymphocytic leukemia (C) with 11 samples. The dimension of 
the gene expression data after pre-processing is 4026. (The pre-processing included normalization, 
imputation, log transformation, and standardization to zero mean and unit variance across genes.) 

First we applied K-means to the data 1000 times with K = 3 and random starts, and found an 
average CR of 0.79 with a standard error of 0.13. When we applied MWKM to the data, we found its 
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CR to be 0.67, noticeably worse than R-means. 

Now consider the following procedure. Trichotomize all 4026 variables by using their 33% and 
66 % percentiles and relabel them as '1', ‘2’, and '3'. Then, apply WOR and WR to the discretized 
data. To get CRs for hierarchical methods, we must convert their dendrograms to clusterings by 
cutting them at some level. To do this, we used the function cuttree (with X = 3) in R. The resulting 
CRs for WOR, WR were 0.71 and 0.84, respectively. That is, even when the data are discrete only 
because they were made that way artificially, WR handily outperformed three other methods. 



Figure 2: Dendrogram of WR. 

The full dendrogram from WR is given in Fig. Along the horizontal axis, the correct labels of 
the classes are given. If these were ignored and one were merely to eyeball the data, one could be led 
to put the two rightmost D's and the first two F's into the cluster of C's, giving an CR of 58/62 = .93 
- much higher than .84. One could just as well put the leftmost D's into one cluster, the next 13 D's 
into a second cluster, the next eight D's into a third cluster, and the rightmost 18 observations into a 
fourth cluster, leaving the intervening data points essentially as a fifth cluster that does not cohere. 
In this case, the CR would be terrible. So, even though the data are artificially discretized, using an 
automated method of WR on a discretized WR and cutree gives a result in the midrange of what 
informal methods would give. This is evidence that ensemble methods such as WR are inherently 
robust. Otherwise put, reading dendrograms informally can be misleading whereas formal methods 
may be reliably accurate. 

5 Extension to high dimensional vectors of unequal length 

We extend our method to clustering categorical vectors of different lengths. This is an important 
clustering problem in genomics because it is desirable to be able to cluster strains of organisms, for 
instance, even though their genomes have different lengths in terms of number of nucleotides. The 
first step is to preprocess the data so all the vectors have the same length. This process is called 
alignment. The aligned vectors can then be clustered using the technique of Sec. The point of this 
section is to verify that our clustering method is effective even after alignment. 

To be specific, consider sequence data of the form W,- = (x,jX,' 2 ...^j 7 ) with i = l,...,n in which each 
Xij for j = 1,. is a nucleotide in {A, T, C, G}. It is obvious that a sufficient condition for our method 
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to apply is that all the xij's assume values in sets Aj for which #{Aj) is bounded. To find a common 
value for the n sequences we align them using software called MAFFT-7 (Katoh and Standley (2013 ). 
MAFFT-7 is a multiple alignment program for amino acid or nucleotide sequences. The basic pro¬ 
cedure was first presented in Needleman and Wunsch (1970 '. A recent comparison of algorithms 
and software for this kind of alignment problem was carried out in Katoh and Standley ( 2013| who 
argued that MAFFT-7 is faster and scales up better than other implementations such as CLUSTAL and 
MUSCLE. 


At the risk of excessive oversimplification, the basic idea behind alignment procedures is as fol¬ 
lows. Suppose two sequences x, and xj. of different lengths are to be aligned. Then, the alignment 
procedure introduces place holders represented by (p so that the two sequences are of the same length 
and the subsequences that do match are in the same place along the overall sequence. When more 
than two sequences must be aligned, a progressive alignment can be used, i.e., two sequence are 
aligned and fixed, the third one aligned to previous ones, and the procedure continues until all se¬ 
quences are aligned. Given that a collection of genomic sequences have been aligned, we can cluster 
them by applying our technique. 

In the absence of established theory for this more complicated case, we present two examples 
to verify that the procedure gives reasonable results. Both of our examples concern viruses: Their 
genomes are large enough to constitute a nontrivial test of our clustering method and of different 
enough in lengths from species to species that alignment of some sort is necessary. 

As a first simple example, consider the virus family Filovirdae . This family includes numerous 
related viruses that form filamentous infectious viral particles (virions). Their genomes are repre¬ 
sented as single-stranded negative-sense RNAs. The two members of the family that are best known 
are Ebolavirus and Marburgvirus. Both viruses, and some of their lesser known relatives, cause severe 


disease in humans and nonhuman primates in the form of viral hemorrhagic fevers (see Pickett et al. 

[|( [20l2| )). 


There are three genera in Filoviridae, and we chose as our data set all the complete and distinct viral 
genomes with a known host from this family available from ViPR. There were 103 in total from 3 gen¬ 
era, namely, Cuevavirus (1, Cue), Ebolavirus (80), and Marburgvirus (22, Mar) where the indicators in 
parentheses show the frequency and the abbreviation. Ebolavirus further subdivides into five species: 
Bundibugyo virus (3, Bun), Reston ebolavirus (5, Res), Sudan ebolavirus (6, Sud), Tai Forest ebolavirus (2, 
Tai), and Zaire ebolavirus (64, Zai). The hosts are human, monkey, swine, guinea pig, mouse, and bat 
(denoted hum, mon, swi, gpi, mou, bat, respectively, on the dendrograms). The minimum and max¬ 
imum genome lengths are 18623 and 19114. While we recognize that the genomes in the pathogen 
virus resource are not drawn independently from a population, we can nevertheless apply our method 
and evaluate the results. 

We apply only WR as W OR and MW KM performed poorly on high dimensional data. In addition, 
we used HCSL, HCCL, and HCAL because we had a dissimilarity that could be used after alignment. 
Essentially, as long as the distance between (p and the nucleotides could be omitted from the Hamming 
distance sum, the dissimilarity was well-defined. It turned out that all three gave nearly identical 
results although FI CAL was slightly better. The important point is that WR performed better than 
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the three non-ensemble methods because the ensembling over dissimilarity matrices gives a better 
assessment of the distance between aligned genomes than not ensembling. 

The results for HCAL are shown in the Fig. It is seen that the subpopulations are well sep¬ 
arated. Virologically speaking, this means that the various species correspond to relatively tight 
clusters. Within the Marburg cluster (on the left) it appears that most of the genomes have either 
humans or bats as hosts suggesting that one organism (probably the bats) is transmitting Marburg 
to the other (probably human). Sudan ebolavirus mainly afflicts humans (pending more data) while 
Reston ebolavirus is known not to be a pathogen for humans. The vast majority of the Zaire and 
Bunibugyo genomes have human as host. 

Fig. shows the corresponding dendrogram using WR, an ensemble method. Qualitatively the 
results are the same as for HCAL. The improvement of WR over HCAL is seen in the fact that WR 
reveals greater separation between the clusters. Indeed, even if one corrects for the vertical scale, 
the leaves within a cluster under WR separate from each other at a much finer level. That is, the 
ensembling over the dissimilarities accentuates the differences between genomes in different clusters 
as discussed in Sec. [Q 



Figure 3: Dendrogram for the Liloviridae data set using HCAL. The leaves are labeled with the Genus 
or species and host. 



Figure 4: Dendrogram for the Liloviridae data set using WR under the same labeling convention as 
the Fig. 


These dendrograms can be contrasted with a phylogenetic tree for the Liloviridae viruses. Figure 
shows the phylogenetic tree generated by the neighbor-joining (NJ) method as implemented in the 
R package Ape |Paradis et al. | |2004| ). The NJ method constructs a tree by successive pairing of 
the neighbors. The idea behind a phylogenetic tree, as opposed to a dendrogram, is to represent 
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the sequence of evolutionary steps through which organisms mutated as a reasonable way to classify 
the existing and extinct organisms. The goals of the two sorts of trees are somewhat different and 
one would not expect them to agree fully, since clustering only gives a mathematically optimal path 
to the evolutionary endpoint while phylogenetic trees try to track genomic changes. For instance, 
the phylogenetic tree shows that Zaire ebolavirus with a human host separates early into two distinct 
groups which may or may not be reasonable evolutionarily and is different from Fig. Tai Forest 
ebolavirus and Bunidbugyo virus genomes are seen to be possibly close evolutionary but are not close 
in Fig. On the other hand, Sudan ebolavirus and Reston ebolavirus are seen to be close in terms of 
both clustering and phylogenetics while Marburg is a separate and recent genus, consistent with it 
being its own cluster. 



Figure 5: Phylogenetic tree generated for Filoveridae by NJ. It is broadly similar to the dendrograms 
in Figs. [^and[^ but differs in some details. 

As a second and more complicated example, we studied the Herpesviridae family of viruses that 
cause diseases in humans and animals. Herpesviridae is a much larger family than Filoviridae and the 
genomes in Herpesviridae are generally longer as well as more varied in length than those in Filoviri¬ 
dae. According to ViPR, the family Herpesviridae is divided into three subfamilies {Alphaherpesviri- 
nae, Betaherpesvirinae and Gammaherpesvirinae). We limited our analysis to the distinct and complete 
genomes in Alphaherpesvirinae that have known hosts; Alphaherpesvirinae has more complete genomes 
than either Betaherpesviridae or Gammaherpesvidae. Within Alphaherpesviridae there are has five gen¬ 
era: Iltovirus (lit), Mardivirus (Mar), Scutavirus, Simplexvirus (Sim), and Varicellovirus (Var). Since 
Scutavirus did not have complete any complete genomes, we disregarded this genus. The rest remain¬ 
ing genera had 20, 18, 20, 40 genomes, respectively, from different hosts, namely, human. Monkey, 
chicken, Turkey, Duck, cow. Bat (Fruit), equidae (horse). Boar, Cat family, Amazona oratrix (denoted 
hum, mon, chi, tur, due, cow, bat, equ, boa, cat, and ora, respectively in the dendrograms). These viral 
genomes have lengths ranging from 124784 to 178311 base pairs. 

Parallel to the Ebolavirus example, we present the two dendrograms corresponding to HCAL and 
WR. These are in Figj^andj^ The top panel shows that HCAL, the non-ensembled version based on 
Flamming distance, is qualitatively the same as the lower panel. As before, the key difference is that 
WR yields a cleaner separation of clusters relative to HCAL. It is important to note that the clusters in 
the dendrograms correspond to (genus, host) pairs. That is, the clustering corresponds to identifiable 


24 

















physical differences so the clusters have a clear interpretation. 
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Figure 6: Dendrogram for the Alphaherpesviridae data set using HCAL. The leaves are labeled with 
the genome Genus or subpopulation and host. 
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Figure 7: Dendrogram for the Alphaherpesviridae data set using WR under the same labeling conven¬ 
tion as the top panel. 



Figure 8: Phylogenetic tree generated for Alphaherpesviridae. It is broadly similar to the dendrograms 
in Figs. [^and[^ but differs in some details. 

The same cannot be said for the phylogenetic tree generated by NJ as before and shown in Fig. 
For instance, Varicellovirus from host equidae are partitioned into two clusters early. Also, Mardivirus 
from chicken and duck hosts are not cleanly separated. On the other hand, most other population- 
host pairs are fairly well separated. Overall, Fig. [^does not give as good a clustering as the panels in 
Figs. [^and[^ 

6 Conclusions 

In this paper we have presented a method for clustering categorical data in low, high, and varying 
dimensions. We began with relatively small dimensions, up to 35 for the SOYBEAN data, and studied 
the way our method seemed to improve over other methods. Specifically, we ensembled over dis¬ 
similarity matrices in an effort to represent the distance between data points more accurately. Our 
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theoretical work in Sec. |3.2| provides some formal justification for why this sort of technique should 
perform well in some generality. 

Then we turned to the clustering of high dimensional categorical data, focusing on genomic data. 
We extended our ensemble method for low dimensional data to high dimensional categorical vectors 
of equal length by adding a layer of ensembling: We obtained dissimilarity matrices by ensembling 
over randomly selected dimensions. We then used our method on categorical vectors of different 
lengths by artificially making them the same length through alignment procedures. Again, our en¬ 
sembling method performed better than the other methods we tested. In particular, we compared the 
output of our method in this case to phylogenetic trees. While not strictly scientific, the dendrograms 
we generated can be interpreted physically and differ in some important respects from phylogenetic 
trees generated from the same data. 

Throughout we have used a large number of simulated and real data examples to buttress the 
intuition behind the technique and formal results. We comment that there are many other tests of the 
general methodology that could be done. For instance, in our clustering of viral genomes we could 
have included incomplete genomes. However, it many cases the incomplete genomes had over 90% 
of the nucleotides missing and we thought this insufficient for good conclusions. 
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